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Abstract 



CSJ The problem of incomplete data — i.e., data with missing or unknown values — in 

^~~^ multi-way arrays is ubiquitous in biomedical signal processing, network traffic 

^_^ analysis, bibliometrics, social network analysis, chemometrics, computer vision, 

.^ communication networks, etc. We consider the problem of how to factorize 

h^- data sets with missing values with the goal of capturing the underlying latent 

^^ structure of the data and possibly reconstructing missing values (i.e., tensor 

1-^ completion). We focus on one of the most well-known tensor factorizations that 

t^ captures multi-hnear structure, CANDECOMP/PARAFAC (CP). In the pres- 

Ch ence of missing data, CP can be formulated as a weighted least squares problem 

i__i that models only the known entries. We develop an algorithm called CP-WOPT 

(CP Weighted OPTimization) that uses a first-order optimization approach to 
solve the weighted least squares problem. Based on extensive numerical exper- 
iments, our algorithm is shown to successfully factorize tensors with noise and 

^>^ up to 99% missing data. A unique aspect of our approach is that it scales to 

^— I sparse large-scale data, e.g., 1000 x 1000 x 1000 with five million known entries 

Cn (0.5% dense). We further demonstrate the usefulness of CP-WOPT on two 

l/^ real-world applications: a novel EEG (electroencephalogram) application where 

^^ missing data is frequently encountered due to disconnections of electrodes and 

^—^ the problem of modeling computer network traffic where data may be absent 

. . due to the expense of the data collection process. 
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1. Introduction 

Missing data can arise in a variety of settings due to loss of information, 
errors in the data collection process, or costly experiments. For instance, in 
biomedical signal processing, missing data can be encountered during EEG anal- 
ysis, where multiple electrodes are used to collect the electrical activity along 
the scalp. If one of the electrodes becomes loose or disconnected, the signal is 
either lost or discarded due to contamination with high amounts of mechanical 
noise. We also encounter the missing data problem in other areas of data min- 
ing, such as packet losses in network traffic analysis [2] and occlusions in images 
in computer vision [3]. Many real- world data with missing entries are ignored 
because they are deemed unsuitable for analysis, but this work contributes to 
the growing evidence that such data can be analyzed. 

Unlike most previous studies on missing data which have only considered 
matrices, we focus here on the problem of missing data in tensors because it 
has been shown increasingly that data often have more than two modes of vari- 
ation and are therefore best represented as multi-way arrays (i.e., tensors) [4, 5]. 
For instance, in EEG data each signal from an electrode can be represented as a 
time-frequency matrix; thus, data from multiple channels is three-dimensional 
(temporal, spectral, and spatial) and forms a three-way array [6]. Social network 
data, network traffic data, and bibliomctric data are of interest to many appli- 
cations such as community detection, link mining, and more; these data can 
have multiple dimensions/modalities, are often extremely large, and generally 
have at least some missing data. These are just a few of the many data analysis 
applications where one needs to deal with large multi-way arrays with missing 
entries. Other examples of multi-way arrays with missing entries from different 
disciplines have also been studied in the literature [7, 8, 9]. For instance, [7] 
shows that, in spectroscopy, intermittent machine failures or different sampling 
frequencies may result in tensors with missing fibers (i.e., the higher-order ana- 
logues of matrix rows or columns, see Figure 1). Similarly, missing fibers are 
encountered in multidimensional NMR (Nuclear Magnetic Resonance) analysis, 
where sparse sampling is used in order to reduce the experimental time [8]. 




Figure 1: A 3- way tensor with missing row fibers (in gray). 



National Nuclear Security Administration under Contract DE-AC04-94AL85000. 



X 



Ci C2 



^ n 

+ 



^ 



'ii 



Uai 



a2 



ai? 



Figure 2: Illustration of an i?-component CP model for a third-order tensor OC. 



Our goal is to capture the latent structure of the data via a higher-order 
factorization, even in the presence of missing data. Handling missing data in 
the context of matrix factorizations, e.g., the widely-used principal component 
analysis, has long been studied [10, 11] (see [3] for a review). It is also closely 
related to the matrix completion problem, where the goal is to recover the 
missing entries [12, 13] (see §3 for more discussion). Higher-order factorizations, 
i.e., tensor factorizations, have emerged as an important method for information 
analysis [4, 5]. Instead of flattening (unfolding) multi-way arrays into matrices 
and using matrix factorization techniques, tensor models preserve the multi-way 
nature of the data and extract the underlying factors in each mode (dimension) 
of a higher-order array. 

We focus here on the CANDECOMP/PARAFAC (CP) tensor decomposition 
[14, 15], which is a tensor model commonly used in various applications [6, 16, 
17, 18, 19]. To illustrate differences between matrix and tensor factorizations, 
we introduce the CP decomposition for three-way tensors; discussion of the CP 
decomposition for general N-way tensors can be found in §4. Let X be a three- 
way tensor of size I x J x K, and assume its rank is R (see [5] for a detailed 
discussion on tensor rank) . With perfect data, the CP decomposition is defined 
by factor matrices A, B, and C of sizes I x R, J x R, and K x R, respectively, 
such that 

Ft. 

^ijk = ^ airbjrCkr, for aU i = 1, . . . , /, j = 1, . . . , J, k ^ I,.. .,K. 

r=l 

In the presence of noise, the true % is not observable and we cannot expect 
equality. Instead, the CP decomposition should minimize the error function 
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An illustration of CP for third-order tensors is given in Figure 2. The CP 
decomposition is extensible to A^-way tensors for A^ > 3, and there are numerous 
methods for computing it [20]. 

In the case of incomplete data, a standard practice is to impute the missing 
values in some fashion (e.g., replacing the missing entries using average values 
along a particular mode). Imputation can be useful as long as the amount 



of missing data is small; however, performance degrades for large amounts of 
missing data [10, 1]. As a better alternative, factorizations of the data with 
imputed values for missing entries can be used to rc-impute the missing values 
and the procedure can be repeated to iteratively determine suitable values for 
the missing entries. Such a procedure is an example of the expectation max- 
imization (EM) algorithm [21]. Computing CP decompositions by combining 
the alternating least squares method, which computes the factor matrices one 
at a time, and iterative imputation (denoted EM-ALS in this paper) has been 
shown to be quite effective and has the advantage of often being simple and fast. 
Nevertheless, as the amount of missing data increases, the performance of the 
algorithm may suffer since the initialization and the intermediate models used 
to impute the missing values will increase the risk of converging to a less than 
optimal factorization [7]. Also, the poor convergence of alternating methods 
due to their vulnerability to flatlining, i.e., stagnation, is noted in [3]. 

In this paper, though, we focus on using a weighted version of the error 
function to ignore missing data and model only the known entries. In that case, 
nonlinear optimization can be used to directly solve the weighted least squares 
problem for the CP model. The weighted version of (1) is 
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/w(A, B, C) = - ^ ^ ^ <^ w,jk I Xijk - Y^ I 

i=l j = l k=l I \ r=l 
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where W, which is the same size as DC, is a nonnegative weight tensor defined 
as 



I 1 if Xijk is known, 
Wijk = <^ .^ . . . for all z = 1,...,/, j = 1,..., J, /c = 1,...,X. 

I U it Xijk is missing. 

Our contributions in this paper are summarized as follows, (a) We develop 
a scalable algorithm called CP-WOPT (CP Weighted OPTimization) for tensor 
factorizations in the presence of missing data. CP-WOPT uses first-order opti- 
mization to solve the weighted least squares objective function over all the fac- 
tor matrices simultaneously, (b) We show that CP-WOPT can scale to sparse, 
large-scale data using specialized sparse data structures, significantly reducing 
the storage and computation costs, (c) Using extensive numerical experiments 
on simulated data sets, we show that CP-WOPT can successfully factor ten- 
sors with noise and up to 99% missing data. In many cases, CP-WOPT is 
significantly faster than the best published direct optimization method in the 
literature [7] . (d) We demonstrate the applicability of the proposed algorithm 
on a real data set in a novel EEC application where data is incomplete due to 
failures of particular electrodes. This is a common occurrence in practice, and 
our experiments show that even if signals from almost half of the channels are 
missing, underlying brain activities can still be captured using the CP-WOPT 
algorithm, illustrating the usefulness of our proposed method, (e) In addition 
to tensor factorizations, we also show that CP-WOPT can be used to address 
the tensor completion problem in the context of network traffic analysis. We 



use the factors captured by the CP-WOPT algorithm to reconstruct the tensor 
and illustrate that even if there is a large amount of missing data, the algorithm 
is able to keep the relative error in the missing entries close to the modeling 
error. 

The paper is organized as follows. We introduce the notation used through- 
out the paper in §2. In §3, we discuss related work in matrix and tensor factor- 
izations. The computation of the function and gradient values for the general 
A^-way weighted version of the error function and the presentation of the CP- 
WOPT method are given in §4. Numerical results on both simulated and real 
data are given in §5. Conclusions and future work are discussed in §6. 

2. Notation 

Tensors of order A^ > 3 are denoted by Euler script letters (X,'y,23), ma- 
trices are denoted by boldface capital letters (A, B, C), vectors are denoted by 
boldface lowercase letters (a, b, c), and scalars are denoted by lowercase letters 
(a, b, c). Columns of a matrix are denoted by boldface lower letters with a sub- 
script (ai , a2 , a3 arc first three columns of A) . Entries of a matrix or a tensor 
are denoted by lowercase letters with subscripts, i.e., the {ii,i2, ■ ■ ■ ,iN) entry 
of an N-way tensor % is denoted by a:;^^^^...^^. 

An TV-way tensor can be rearranged as a matrix; this is called matricization, 
also known as unfolding or flattening. The mode-n matricization of a tensor 
% e M^i^^^^'"^^™ is denoted by X(„) and arranges the mode-n one-dimensional 
"fibers" to be the columns of the resulting matrix; see [1, 5] for details. 

Given two tensors X and ^ of equal size Ii x I2 x ■ ■ ■ x Ij^, their Hadamard 
(elementwise) product is denoted by X * ^ and defined as 

(X*y).^.^...^^ =a;jji,...,„y,j,2...i„ for aU z„ e {!,...,/„} and ne {1,...,N} 

The inner product of two same-sized tensors X,y E ]fjJix^2x-x/jv jg ^j^g g^,^ qJ 
the products of their entries, i.e.. 
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For a tensor "X of size Ii x I2 x ■ ■ ■ x Ijy, its norm is || X || = ■\/( X, X). For 
matrices and vectors, || • || refers to the analogous Frobenius and two- norm, 
respectively. We can also define a weighted norm as follows. Let X and W be 
two tensors of size Ii x I2 x ■ ■ ■ x I^. Then the W- weighted norm of X is 



X||^ = ||W*X| 



Given a sequence of matrices A^"' of size !„ x R for n = 1, ... ,7V, the 
notation |A^"'^^ , A^^' , . . . , A*- '] defines an /i x /2 x • • • x /^r tensor whose elements 
are given by 
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For just two matrices, this reduces to familiar expressions: JA, B] = AB . 
Using the notation defined here, (2) can be rewritten as 

/w(A,B,C) = i||X-IA,B,Cl||^. 

3. Related Work in Factorizations with Missing Data 

In this section, we first review the approaches for handling missing data in 
matrix factorizations and then discuss how these techniques have been extended 
to tensor factorizations. 

3.1. Matrix Factorizations 

Matrix factorization in the presence of missing entries is a problem that has 
been studied for several decades; see, e.g., [10, 11]. The problem is typically 
formulated analogously to (2) as 



/w(A,B) = :^ X-ABT ' . (3) 



A common procedure for solving this problem is EM which combines imputation 
and alternation [7, 22]. In this approach, the missing values of X are imputed 
using the current model, X = AB , as follows: 

X = W*X + (1- W)*X, 

where 1 is the matrix of all ones. Once X is generated, the matrices A and B 
can then be alternatingly updated according to the error function i|jX — AB |p 
(e.g., using the linear least squares method). See [22, 23] for further discussion 
of the EM method in the missing data and general weighted case. 

Recently, a direct nonlinear optimization approach was proposed for matrix 
factorization with missing data [3]. In this case, (3) is solved directly using a 
2nd-order damped Newton method. This new method is compared to other 
standard techniques based on some form of alternation and/or imputation as 
well as hybrid techniques that combine both approaches. Overall, the conclusion 
is that nonlinear optimization strategies are key to successful matrix factoriza- 
tion. Moreover, the authors observe that the alternating methods tend to take 
much longer to converge to the solution even though they make faster progress 
initially. This work is theoretically the most related to what we propose — the 
main differences are 1) we focus on tensors rather than matrices, and 2) we use 
first-order rather than second-order optimization methods (we note that first 
order methods are mentioned as future work in [3]). 

A major difference between matrix and tensor factorizations is worth noting 
here. In [22, 3], the lack of uniqueness in matrix decompositions is discussed. 
Given any invertible matrix G, [JA,B] = [JAG,BG^ ]. This means that there 
is an infinite family of equivalent solutions to (3). In [3], regularization is rec- 
ommended as a partial solution; however regularization can only control scaling 



indeterminacies and not rotational freedom. In the case of the CP model, of- 
ten a unique solution (including trivial indeterminacies of scaling and column 
permutation) can be recovered exactly; see, e.g., [5] for further discussion on 
uniqueness of the CP decomposition. 

Factorization of matrices with missing entries is also closely related to the 
matrix completion problem. In matrix completion, one tries to recover the miss- 
ing matrix entries using the low-rank structure of the matrix. Recent work in 
this area [12, 13] shows that even if a small amount of matrix entries are avail- 
able and those are corrupted with noise, it is still possible to recover the missing 
entries up to the level of noise. In [13], it is also discussed how this problem 
relates to the field of compressive sensing, which exploits structures in data 
to generate more compact representations of the data. Practically speaking, 
the difference between completion and factorization is how success is measured. 
Factorization methods aim to increase accuracy in the factors; in other words, 
capture the underlying phenomena as well as possible. Completion methods, 
on the other hand, seek accuracy in filling in the missing data. Obviously, once 
a factorization has been computed, it can be used to reconstruct the missing 
entries. In fact, many completion methods use this procedure. 

3.2. Tensor Factorizations 

The EM procedure discussed for matrices has also been widely employed for 
tensor factorizations with missing data. If the current model is [[A, B, C], then 
we fill in the missing entries of % to produce a complete tensor according to 

X = W * X + (1 - W) * lA, B, C], 

where 1 is the tensor of all ones the same size as W. The factor matrices are 
then updated using alternating least squares (ALS) as those that best fit X. See, 
e.g., [24, 25] for further details. As noted previously, we denote this method as 
EM-ALS. 

Paatero [26] and Toniasi and Bro [7] have investigated direct nonlinear ap- 
proaches based on Gauss-Newton (GN). The code from [26] is not widely avail- 
able; therefore, we focus on [7] and its INDAFAC (INcompletc DAta paraFAC) 
procedure which specifically uses the Levenberg-Marquardt version of GN for 
fitting the CP model to data with missing entries. The primary application in 
[7] is missing data in chcmometrics experiments. This approach is compared 
to EM-ALS with the resuh being that INDAFAC and EM-ALS perform almost 
equally well in general with the exception that INDAFAC is more accurate for 
difficult problems, i.e., higher collinearity and systematically missing patterns 
of data. In terms of computational efficiency, EM-ALS is usually faster but 
becomes slower than INDAFAC as the percentage of missing entries increases 
and also depending on the missing entry patterns. 

Both INDAFAC and CP-WOPT address the problem of fitting the CP model 
to incomplete data sets by solving (2). The difference is that INDAFAC is 
based on second-order optimization while CP-WOPT is first-order with a goal 
of scaling to larger problem sizes. 



4. CP-WOPT Algorithm 

We consider tlie general TV-way CP factorization problem for tensors with 
missing entries. Let !X be a real- valued tensor of size Ii x I2 x ■ ■ ■ x I^ and 
assume its rank is known to be R.^ Define a nonnegative weight tensor TV of 
the same size as % such that 

_ J 1 if Xi^i^...i^ is known, 
1^0 it Xi^i^...ij^ IS missmg, 

for all i„ e {1, . . . , /„} and n G {1, . . . , N}. (4) 

The TV-way objective function is defined by 

/w(AW,A(2),...,AW)^i||(X-IA«,...,AWl)||^. (5) 

It is a function of P = R'l2n=i ^n variables. Due to the well-known indetermi- 
nacies of the CP model, it may also be desirable to add regularization to the 
objective function as in [20], but this has not been necessary thus far in our 
experiments. 

Equation (5) is equivalent to 

/w(A«,A(2),...,AW) = i||l^-2,|r, 

where 1^ = W * X and 2, = W * ^A^^', . . . , A^^']. (6) 

The tensor y can be precomputed as neither W nor % change during the iter- 
ations. We will sec later that %, can be computed efficiently for sparse W. 

Our goal is to find matrices A*^"' G M^"^-'^ for n = 1, . . . , A^ that minimize the 
weighted objective function in (5). We show how to compute the function and 
gradient values efficiently when W is dense in §4.1 and when W is sparse in §4.2. 
Once the function and gradient are known, any gradient-based optimization 
method [27] can be used to solve the optimization problem. 

The derivation of the gradient in the weighted case is given in [1] ; here we 
just report the formula. In matrix notation, using y and Z from (6), we can 
rewrite the gradient equation as 

^•^J^ = (Z(„)-Y(„))A(-"), (7) 



where 



a(-") = aW ... a("+i) a("-i) • • • a(i) 



for n = I, . . . ,N. The symbol denotes the Khatri-Rao product; see, e.g., [1, 5] 
for details. 



^In practice, the rank is generally not known and is not easily determined. Understanding 
the performance of the methods under consideration in that scenario is a topic of future work. 
Results in [20] indicate that direct optimization methods have an advantage over alternating 
least squares approaches when the rank is overestimated. 



4-.1. Computations with Dense W 

Figure 3a shows the algorithmic steps for computing the function and gradi- 
ent values. We are optimizing a function of P variables as defined by the factor 
matrices A*-^-* through A*^ •*. The gradient is computed as a series of matrices 
G^"-* = „ ■•[„) for n = 1, . . . , iV. Recall that T(„) in the last step is the unfolding 
of the tensor T in mode n. 

4-2. Computations with Sparse W 

If a large number of entries is missing, then W is sparse. In this case, there 
is no need to allocate storage for every entry of the tensor "X. Instead, we can 
store and work with just the known values, making the method efficient in both 
storage and time. Figure 3b shows analogous computations to Figure 3a in the 
case that W is sparse. 

Let the ordered set X = {i^^-*,i^^', . . . , i*-*^'} be the indices of the known 
values, i.e., all the locations where TV = 1. Each i^'^-', q € {1,...,Q}, is an 
N-tuple of indices whose nth entry is denoted by z„ . The known entries of X 
can be stored in an array y of length Q so that 

Vq^XMA.,) ,(9) for q=l,...,Q. 

Observe that ||W*X|| = ||y|| • Thus, the value of 7 is the same in both 
Figure 3a and Figure 3b. 

In the dense version (Figure 3a), we need to compute % = 'W*|A'-^', . . . , A*- '] 
In the sparse version, we observe that % is necessarily zero anywhere there is a 
missing entry in %. Consequently, it is only necessary to compute the entries of 
Z corresponding to known values in I; the vector z corresponds to these known 
entries. The computations for Step 2 in Figure 3b can be done efficiently in 
MATLAB using the "expanded" vectors technique of [28, §3.2.4]. Let the rth 
summand be denoted by u, i.e., 

N 
n=l 

Let v^") be "expanded" vectors of length Q for n == 1, . . . , TV defined by 

<'=a(rJ, for g=l,...,g. (8) 

tq r 

Then the vector u can be calculated as 

This can naturally be done iteratively (i.e., only one v'") is calculated at a time) 
to reduce storage costs. Likewise, each summand u can be iteratively added to 
compute the final answer z. 

In Step 3, the function values in Figure 3a and Figure 3b arc clearly equiva- 
lent since y and z contain the nonzero values of 'y and Z., respectively. Similarly, 



1. 


Assume y = W * X and 7 = 



■y are precomputed. 




2. 


Compute Z = W* JA^^^, . . 


,aW1. 




3. 


Com.pute function value: / = 


= i7-(y,2,) + i||2,||^ 




4. 


Compute 7 = y — %. 






5. 


Compute gradient matrices: 


G(") = -T(„)A(-") for n = 1, . 


.,iV. 



(a) Dense W. 



1. 


Let Z = {v^' ,'v'^' , . . . , i'*^'} be an ordered set of all the locations where 
W = 1, i.e., the indices of the known values. Let y be the Icngth-Q 
vector of the values of X at the locations indicated by X. Assume y and 
7 = y are precomputed. 


2. 


Compute the Q-vector z as 




R N 

^9 = ) !il«?,v fo'^ g = i,...,0. 

r=ln=l 


3. 


Compute function value: / = ^7 — y'''z + ^ || z |j . 


4. 


Compute t = y — z. 


5. 


Compute gradient matrices G^"' for n = 1, . . . , A^ as follows: 




Q f N \ 

,:4'"=. V "^" / 



(b) Sparse W. 

Figure 3: CP-WOPT computation of function value and gradient. 
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the vector t in Step 4 of Figure 3b represents just the nonzero values of T in 
Figure 3a. 

The computation of the gradients in Step 6 of Figure 3b performs a matricized- 
tensor-times-Khatri- Rao-product (mttkrp) calculation, which has been described 
for the sparse data case in [28, §5.2.6]. Here we briefly summarize the method- 
ology. The rth column of 0^"% gr is calculated as follows. Let the vectors 
v^"^ be defined as above in (8), but define u instead as 

U = t * v(l) * v(2) * • • • v("-l) * v("+l) * • • • * v(^). 

Then 



(g(")) - Y. "« f°'' J = l,----^n- 






This can be computed efficiently using the accumarray function in MATLAB, 
and the code is available in version 2.4 of the Tensor Toolbox for MATLAB [29]. 

5. Experiments 

On both real and simulated three-way data, we assess the performance of 
the CP-WOPT method. We compare CP-WOPT with other methods and also 
demonstrate its performance on two applications. 

5.1. Computational environment 

All experiments were performed using MATLAB 2009b on a Linux Worksta- 
tion (RedHat 5.2) with 2 Quad-Core Intel Xeon 3.0GHz processors and 32GB 
RAM. Timings were performed using MATLAB's tic and toe functions since 
cputime is known to produce inaccurate results for multi-CPU and/or multi- 
core systems. 

CP-WOPT is implemented in the Tensor Toolbox [29]. We consider dense 
and sparse versions, based on the gradient and function computations shown in 
Figure 3a and Figure 3b, respectively. We use the nonlinear conjugate gradient 
(NCG) method with Hestenes-Stiefel updates [27] and the More-Thuente line 
search [30] provided in the Poblano Toolbox [31] as the optimization method. 

We compare CP-WOPT to two other methods: EM-ALS (implemented in 
the N-way Toolbox for MATLAB, version 3.10 [32]) and INDAFAC [33], which is 
a damped Gauss-Newton method proposed by Tomasi and Bro [7] . Previously, 
Tomasi and Bro showed that INDAFAC converged to solutions in many fewer 
iterations than EM-ALS. 

The stopping conditions are set as follows. All algorithms use the relative 
change in the function value /w in (2) as a stopping condition (set to 10~^). In 
INDAFAC, the tolerance on the infinity norm of the gradient is set to 10^® and 
the maximum number of iterations is set to 500. In CP-WOPT, the tolerance on 
the two-norm of the gradient divided by the number of entries in the gradient is 
set to 10~*, the maximum number of iterations is set to 500, and the maximum 
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number of function evaluations is set to 10000. In EM-ALS, the maximum 
number of iterations (equivalent to one function evaluation) is set to 10000. 

All the methods under consideration are iterative methods. We used mul- 
tiple starting points for each randomly generated problem. The first starting 
point is generated using the n-mode singular vectors "^ of X with missing entries 
replaced by zero. In our preliminary experiments, this starting procedure pro- 
duced significantly better results than random initializations, even with large 
amounts of missing data. In order to improve the chances of reaching the global 
minimum, additional starting points were generated randomly. The same set of 
starting points was used by all the methods in the same order. 

5.2. Validation metrics 

If the true factors are known, then we can assess the recovery of the fac- 
tors via the factor match score (FMS) defined as follows. Let the correct and 
computed factorizations be given by 

^A.a«oa(2)o...oaW and ^ A. a^ o 4^) „ . . . „ ^W^ 

r— 1 r— 1 

respectively. Without loss of generality, we assume that all the vectors have 
been scaled to unit length and that the scalars are positive. Further, we assume 
R > R so that the computed solution has at least as many components as the 
true solution. (One could add all-zero components to the computed solution if 
R < R.) Recall that there is a permutation ambiguity, so all possible matchings 
of components between the two solutions must be considered. Under these 
conditions, the FMS is defined as 

FMS= max 1 V fl - '\" V^M TT |a(")V"\|. (9) 

li R = R, then the set n(i?, R) is all permutations of 1 to R; otherwise, it is 
all possible permutations of all (^) mappings of {1, ... , R} to {1, ... , R}. The 
FMS can be between and 1, and the best possible FMS is 1. If ^ > i?, some 
components in the computed solution are completely ignored in the calculation 
of FMS, but this is not an issue for us because we use R = R throughout. 

We also consider the problem of recovering missing data. Let % be the 
original data and let % be the tensor that is produced by the computed model. 
Then the tensor completion score (TCS) is 

^^^- ||(i-w)*x|| • ^^°^ 

In other words, the TCS is the relative error in the missing entries. TCS is 
always nonnegative, and the best possible score is 0. 



^The n-mode singular vectors are the left singular vectors of X(„) (X unfolded in mode n). 
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5.3. Simulated data 

We consider the performance of the methods on moderately-sized problems 
of sizes 50 X 40 X 30, 100 x 80 x 60, and 150 x 120 x 90. For all sizes, we set 
the number of components in the CP model to be i? = 5. We test 60%, 70%, 
80%, 90%, and 95% missing data. The experiments show that the underlying 
factors can be captured even if the CP model is fit to a tensor with a significant 
amount of missing data; this is because the low-rank structure of the tensor 
is being exploited. A rank-i? CP model for a tensor of size I x J x K has 
R{I -|-J-|-i^— l)-fl degrees of freedom. The reason that the factors can be 
recovered accurately, even with 95% missing data, is that there is still a lot 
more data than variables, i.e., the size of the data is equal to 0.05 UK which 
is much greater than the R{I -|-J-|-i^— 1)-|-1 variables for large values of /, 
J, and K. Because it is a nonlinear problem, we do not know exactly how many 
data entries are needed in order to recover a CP model of a low-rank tensor; 
however, a lower bound for the number of entries needed in the matrix case has 
been derived in [12]. 

5.3.1. Generating simulated data 

We create the test problems as follows. Assume that the tensor size is 
I X J X K and that the number of factors is R. We generate factor matrices A, 
B, and C of sizes I x R, J x R, and K x R, respectively, by randomly choosing 
each entry from M{0, 1) and then normalizing every column to unit length. Note 
that this method of generating the factor matrices ensures that the solution is 
unique with probability one for the sizes and number of components that we 
are considering because R <C niin{/, J, K}.'^ 

We then create the data tensor as 

X=IA,B,Cl+r7^X 

Here ^Sf is a noise tensor (of the same size as X) with entries randomly selected 
from A/'(0, 1), and rj is the noise parameter. We use rj = 10% in our experiments. 
Finally, we set some entries of each generated tensor to be missing accord- 
ing to an indicator tensor IV. We consider two situations for determining the 
missing data: randomly missing entries and, in Appendix B, structured missing 
data in the form of randomly missing fibers. In the case of randomly missing 
entries, IV is a binary tensor such that exactly [Af/Ji^J randomly selected en- 
tries are set to zero, where M e (0, 1) defines the percentage of missing data. 
We require that every slice of W (in every direction) have at least one nonzero 
because otherwise we have no chance of recovering the factor matrices; this is 
related to the problem of coherence in the matrix completion problem where it 
is well-known that missing an entire row (or a column) of a matrix means that 



*Since each matrix has R linearly independent columns with probability one and the fc-rank 
(i.e., the maximum value k such that any k columns are linearly independent) of each factor 
matrix is _R, we satisfy the necessary conditions defined by Kruskal [34] for uniqueness. 
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the true factors can never be recovered. For each problem size and missing data 
percentage, thirty independent test problems are generated. 

5.3.2. Results 

In our experiments, CP-WOPT, INDAFAC, and EM-ALS were indistin- 
guishable in terms of accuracy. Figure 4 show box plots^ of the FMS scores for 
problems of size 50 x 40 x 30. Results for problems of size 100 x 80 x 60 and 
150 X 120 X 90 are provided in Appendix A. The plots summarize the FMS 
scores of the computed factors for 30 independent problems for each problem 
size and percentage of missing data: the median values are shown as black dots 
inside a colored circle, the 25th and 75th percentile values are indicated by the 
top and bottom edges of the solid bars extending from the median values, and 
the outliers (corresponding to values more than 2.7 standard deviations from 
the means) are shown as smaller colored circles. A lack of solid bars extending 
from the median indicates that the 25th and 75th percentile values are equal to 
the median. The results shown are cumulative (i.e., the best so far) across the 
multiple starts, with the first start using the n-mode singular vectors and the 
remaining starts using random vectors. 

For all methods, additional starting points improve performance; however, 
no one method clearly requires fewer starting points than the others. In general, 
using two or three starting points suffices to solve all problems to high accuracy 
in terms of FMS. For problems with up to 90% missing data, this is illustrated 
by median values which are all very close to one; further, the number of outliers 
decreases with multiple starts. 

Perhaps contrary to intuition, we note that smaller problems are generally 
more difficult than larger problems for the same percentage of missing data. 
Thus, we can see in the bottom plot of Figure 4 that the FMS scores remain 
low even with multiple starts for problems of size 50 x 40 x 30 with 95% missing 
data. It may be that some of these problems have gone below the lower bound 
on the amount of data needed to find a solution; as mentioned previously, a 
lower bound on how much data is needed is known for the matrix case [12] but 
finding such a bound is still an open problem for higher-order tensors. We can 
define the ratio p as 

Number of known tensor entries (1 — M)IJK 

^~ Number of variables ~ i?(/ + J + X - 1) + 1 ' ^ ' 

where smaller values of p indicate more difficult problems. For example, prob- 
lems with 95% missing data of size 50 x 40 x 30 {p « 5) are more difficult than 
those of size 100 x 80 x 60 (p « 20), as illustrated in Figure 4 and Figure A. 10, 
respectively. In the latter figure, corresponding to the larger size (and thus 



*The box plots were generated using the boxplot command in the Statistics Toolbox of 
Matlab. The plots shown here use the "compact" plot style for that command. For more 
details on box plots and the use of this Matlab command, see http://www.mathworks.com/ 
access/helpdesk/help/toolbox/stats/boxplot .html. 
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Figure 4: Comparison of accuracy for problems of size 50 x 40 x 30 for ran- 
domly missing entries with different amounts of missing data. The methods are 
CP-WOPT (red), INDAFAC (green), and EM-ALS (blue). We plot cumula- 
tive results for multiple starts. The first start is based on the n-mode singular 
vectors, and the remaining starts are random. 
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larger value of p) , nearly all problems are solved to high accuracy by all meth- 
ods. Since this is a nonlinear problem, p does not tell the entire story, but it is 
at least a partial indicator of problem difficulty. 

The differentiator between the methods is computational time required. Fig- 
ure 5 shows a comparison of the sparse and dense versions of CPWOPT, along 
with INDAFAC and EM-ALS. The timings reported in the figure are the sum 
of the times for all starting points. The y-axis is time in seconds on a log scale. 
We present results for time rather than iterations because the iterations for 
INDAFAC are much more expensive than those for CPWOPT or EM-ALS. We 
make several observations. For missing data levels up to 80%, the dense version 
of CPWOPT is faster than INDAFAC, but EM-ALS is the fastest overall. For 
90% and 95% missing data, the sparse version of CPWOPT is fastest (by more 
than a factor of 10 in some cases) with the exception of the case of 90% missing 
data for problems of size 150 x 120 x 90. However, for this latter case, the dif- 
ference is not significant; furthermore, there are two problems where EM-ALS 
required more than 10 times the median to find a correct factorization. 

As demonstrated in earlier studies [1, 7], problems with structured missing 
data are generally more difficult than those with randomly missing values. We 
present results for problems of structured missing data in Appendix B. The 
set-up is the same as we have used here, except that we only go up to 90% 
missing data because the accuracy of all methods degrades significantly after 
that point. The results indicate that accuracies of all methods start to diminish 
at lower percentages of structured missing data than for those problems with 
randomly missing data. Timings of the methods also indicate similar behavior 
in the case of structure missing data: With 80% or less missing data EM-ALS is 
fastest, whereas the sparse version of CP-WOPT is in general faster when there 
is 90% missing data. 

Because it ignores missing values (random or structured) , a major advantage 
of CP-WOPT is its ability to perform sparse computations, enabling it to scale 
to very large problems. Neither INDAFAC nor EM-ALS can do this. The 
difficulty with EM-ALS is that it must impute all missing values; therefore, it 
loses any speed or scalability advantage of sparsity since it ultimately operates 
on dense data. Although INDAFAC also ignores missing values, its scalability 
is limited due to the expense of solving the Newton equation. Even though the 
Newton system is extremely sparse, we can see that the method is expensive 
even for moderate-sized problems such as those discussed in this section. In the 
next section, we consider very large problems, demonstrating this strength of 
CP-WOPT. 

5.4- Large-scale simulated data 

A unique feature of CP-WOPT is that it can be applied to problems that 
are too large to fit in memory if dense storage were used. We consider two 
situations: 

(a) 500 X 500 x 500 with 99% missing data (1.25 million known values), and 

(b) 1000 X 1000 X 1000 with 99.5% missing data (5 million known values). 
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Figure 5: Comparison of timings for randomly missing entries for the total time 
for all starting points for different problem sizes and different amounts of missing 
data. The methods are CP-WOPT-Dense (red), CP-WOPT-Sparse (magenta), 
INDAFAC (green), and EM-ALS (blue). 
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5. 4-1- Generating large-scale simulated data 

The method for generating test problems of this size is necessarily differ- 
ent than the one for smaller problems presented in the previous section be- 
cause we cannot, for example, generate a full noise tensor. In fact, the tensors 
y = |A, B, C] (the noise- free data) and Tsf (the noise) are never explicitly fully 
formed for the large problems studied here. 

Let the tensor size he I x J x K and the missing value rate be M. We 
generate the factor matrices as described in §5.3, using i? = 5 components. 
Next, we create the set I with (1 — M)IJK randomly generated indices; this 
set represents the indices of the known (non-missing) values in our tests. The 
binary indicator tensor W is stored as a sparse tensor [28] and defined by 

Ji if(z,i,fc)ez, 

I otherwise. 

Rather than explicitly forming all of 'y = JA, B, C], we only calculate its values 
for those indices in T. This is analogous to the calculations described for Step 2 
of Figure 3b. All missing entries of y are set to zero, and y is stored as a sparse 
tensor. Finally, we set 

where 3\f is a sparse noise tensor such that 

_/aA(0,1) if(i,j,fc)el, 
I otherwise. 

For each problem size, ten independent test problems are generated. The initial 
guess is generated by computing the n-mode singular vectors of the tensor with 
missing values filled in by zeros. 

5.4-2. Results for CP-WOPT on large-scale data 

The computational set-up was the same as that described in §5.1 except that 
the tolerance for the two-norm of the gradient divided by the number of entries 
in the gradient was set to 10"^*^ (previously 10~^). The results across all ten 
runs for each problem size are shown in Figure 6. 

In the 500 x 500 x 500 case, storing a dense version of the tensor (assuming 8 
bytes per entry) would require exactly 1GB. Storing just 1% of the data (1.25M 
entries) in sparse format (assuming 32 bytes per entry, 8 for the value and 8 for 
each of the indices) requires only 40MB. In our randomly generated experiments, 
all ten problems were solved with an FMS score greater than 0.99, with solve 
times ranging between 200 and 1000 seconds. 

In the 1000 x 1000 x 1000 case, storing a dense version of the tensor would re- 
quire 8GB. Storing just 0.5% of the data as a sparse tensor (5M entries) requires 
160MB. In our randomly generated experiments, only nine of the ten problems 
were solved with an FMS score greater than 0.99. The solve times ranged from 
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Figure 6: Results of CP-WOPT-Sparse for large-scale problems with at least 
99% missing data. The means are shown as solid lines. 



2000 to 10000 seconds, approximately 10 times slower than the 500 x 500 x 500 
case, which had half as many variables and 1/4 the nonzero entries. In the 
failed case, the optimization method exited because the relative change in the 
function tolerance was smaller than the tolerance. We restarted the optimiza- 
tion method with no changes except that we use the solution that had been 
computed previously as the initial guess. After 985 seconds of additional work, 
an answer was found with an FMS score of 0.9999. 

5.5. EEG data 

In this section, we demonstrate the use of CP-WOPT in multi-channel EEG 
(electroencephalogram) analysis by capturing the underlying brain dynamics 
even in the presence of missing signals. We use an EEG data set collected to 
observe the gamma activation during proprioceptive stimuli of left and right 
hand [19]. The data set contains multi-channel signals (64 channels) recorded 
from 14 subjects during stimulation of left and right hand (i.e., 28 measurements 
in total). For each measurement, the signal from each channel is represented in 
both time and frequency domains using a continuous wavelet transform and then 
vectorized (forming a vector of length 4392); in other words, each measurement 
is represented by a channels by time-frequency matrix. The data for all mea- 
surements are then arranged as a channels by time-frequency by measurements 
tensor of size 64 x 4392 x 28. For details about the data, see [19]. 

We model the data using a CP model with R — i, denoting A, B, and C as 
the extracted factor matrices corresponding to the channels, time- frequency, and 
measurements modes, respectively. We demonstrate the columns of the factor 
matrices in each mode in Figure 7a. The 3-D head plots correspond to the 
columns of A, i.e., coefficients corresponding to the channels ranging from low 
values in blue to high values in red. The time-frequency domain representations 
correspond to the columns of B rearranged as a matrix and again ranging from 
low values in blue to high values in red. The bar plots represent the columns 
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of C. Note that the three rows of images in Figure 7a (3-D head plot, matrix 
plot, bar plot) correspond to columns r = 1,2,3 of the factor matrices (A, B, 
C), respectively. Observe that the first row of images highlights the differences 
between left and right hand stimulation while the second and third rows of 
images pertain to frontal and parietal activations that are shared by the stimuli. 
Unlike [19], we do not use nonnegativity constraints; we convert tensor entries 
from complex to real values by using the absolute values of the entries and center 
the data across the channels mode before the analysis. 
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Figure 7: Columns of the CP factor matrices (A, B and C with R = 3) extracted 
from the EEG data arranged as a channels by time-frequency by measurements 
tensor. The 3-D head images were drawn using EEGLab [35]. 

It is not uncommon in EEG analysis that the signals from some channels are 
ignored due to malfunctioning of the electrodes. This corresponds to missing 
fibers in the tensor (as in Figure 1) when we arrange the data as described above. 
To reflect such cases of missing data, we randomly set data for one or more of 
the 64 channels for each measurement to be missing, center the tensor across 
the channels mode ignoring the missing entries, and then fit a CP model with 
i? = 3 to the resulting data using CP-WOPT. Let A, B, C be the factor matrices 
extracted from a tensor with missing entries using the CP-WOPT algorithm. 

Table 1 presents the average EMS scores (over 50 instances of the data 
with randomly missing channels) for different numbers of missing channels per 
measurement. As the number of missing channels increases, the average EMS 
scores decrease as expected. However, even with up to 30 missing channels per 
measurement, or about 47% of the data, the extracted factor matrices match 
with the original factor matrices well, with average EMS scores around 0.90. 
Figure 7b presents the images for A, B and C corresponding to those for A, B, C 
in Figure 7a. We see that the underlying brain dynamics are still captured even 
when 30 channels per measurement are missing; only slight local differences 
can be observed between the original data and the model generated using CP- 
WOPT. 

It can be argued that the activations of the electrodes are highly correlated 
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and even if some of the electrodes are removed, the underlying brain dynamics 
can still be captured. However, in these experiments we do not set the same 
channels to missing for each measurement; the channels are randomly missing 
from each measurement. There are cases when a CP model may not be able to 
recover factors when data has certain patterns of missing values, e.g., missing 
the signals from the same side of the brain for all measurements. However, such 
data is not typical in practice. 

Table 1: EEG Analysis with Incomplete Data: The similarity between the 
columns of A, B, C and the columns of factor matrices extracted by CP-WOPT. 
The similarity is measured in terms of FMS defined in (9). 



Number of 
Missing Channels 


CP-WOPT 


1 


0.9959 


10 


0.9780 


20 


0.9478 


30 


0.8949 


40 


0.6459 



Table 1 shows the FMS scores for varying amounts of missing data. The 
solution docs not seriously degrade until 40 of 64 channels are removed. 

5.6. Network traffic data 

In the previous section, we were interested in tensor factorizations in the 
presence of missing data. In this section, we address the problem of recover- 
ing missing entries of a tensor; in other words, the tensor completion problem. 
One application domain where this problem is frequently encountered is com- 
puter network traffic analysis. Network traffic data consists of traffic matrices 
(TM), which record the amount of network data exchanged between source 
and destination pairs (computers, routers, etc.). Since TMs evolve over time, 
network data can be represented as a tensor. For instance, Geant data [36] 
records the traffic exchanged between 23 routers over a period of 4 months col- 
lected using 15-minutc intervals. This data set forms a third-order tensor with 
source routers, destination routers and time modes, and each entry indicates the 
amount of traffic sent from a source to a destination during a particular time 
interval. Missing data arises due to the expense of the data collection process. 

In our study, we used the Geant data collected in April^ stored as a tensor 
of size 23 X 23 X 2756. Let 7 represent this raw data tensor. In order to adjust 



®The data was incomplete for the full four-month period represented in the data, e.g., 6 
days of data was missing at the end of February. We used a subset of the data eorrcsponding 
to a period with no missing time slices. 
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for scaling bias in the amount of traffic, we preprocess T as X = log(Cr + 1). 
Figure 8 presents the results of 2-component CP model (i.e., R = 2); the first 
and second row correspond the first and second column of the factor matrices of 
the model, respectively. This CP model fits the data well but not perfectly and 
there is some unexplained variation left in the residuals. Let % be the tensor 
constructed using the computed factor matrices. The modeling error, defined as 

'' II gt; II , is approximately 0.31 for the 2-component CP model computed. Even 
though extracting more components slightly lowers the modeling error, models 
with more components do not look appropriate for the data. 
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Figure 8: Factor matrices extracted from Geant data using a 2-component CP 
model. The first row illustrates the first column of the factor matrices in each 
mode and the second row shows the second column of the factor matrices. 

In order to assess the performance of CP-WOPT in terms of recovering 
missing data, randomly chosen entries of X are set to missing and a 2-component 
CP model is fit to the altered data. The extracted factor matrices are then 
used to reconstruct the data and fill in the missing entries. These recovered 
values are compared with the actual values based on the tensor completion 
score (TCS) defined in (10). Figure 9 presents the average TCS (across 30 
instances) for different amounts of missing data. We observe that the average 
TCS is around 0.31 when there is little missing data and increases very slowly 
as we increase the amount of missing data. The average TCS is only slightly 
higher, approximately 0.33, even when 95% of the entries are missing. However, 
the average TCS increases sharply when the amount of missing data is 99%. 
Note that even if there is no missing data, there will be completion error due 
to the modeling error, i.e., 0.31. Figure 9 demonstrates the robustness of CP- 
WOPT, illustrating that the TCS can be kept close to the level of the modeling 
error using this method, even when the amount of missing data is high. 
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We have addressed the tensor completion problem using a CP model, which 
gives easily-interpretable factors in each mode. However, for the tensor com- 
pletion problem, the recovery of the missing entries is more important than the 
underlying factors and their interpretability. Therefore, modeling the data using 
a restricted model like CP may not be the best approach in this case. It may 
be possible to achieve lower modeling error using a more flexible model such as 
a Tucker model [37]. If this model is fit to the data using algorithms akin to 
CP-WOPT, missing entries may be recovered more accurately. This is a topic 
of future research. 
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Figure 9: Tensor Completion Score for different amounts of missing data for 
Geant data when a 2-component CP model fitted using CP-WOPT is used to 
recover the missing entries. 



6. Conclusions 

The closely related problems of matrix factorizations with missing data and 
matrix completion have recently been receiving a lot of attention. In this paper, 
we consider the more general problem of tensor factorization in the presence 
of missing data, formulating the canonical tensor decomposition for incomplete 
tensors as a weighted least squares problem. Unlike imputation-based tech- 
niques, this formulation ignores the missing entries and models only the known 
data entries. We develop a scalable algorithm called CP-WOPT using gradient- 
based optimization to solve the weighted least squares formulation of the CP 
problem. The code has been released in version 2.4 of the Tensor Toolbox for 
MATLAB [29]. 

Our numerical studies suggest that the proposed CP-WOPT approach is ac- 
curate and scalable. CP-WOPT can recover the underlying factors successfully 
with large amounts of missing data, e.g., 90% missing entries for tensors of size 
50 X 40 X 30. We have also studied how CP-WOPT can scale to problems of 
larger sizes, e.g., 1000 x 1000 x 1000, and recover CP factors from large, sparse 
tensors with 99.5% missing data. Moreover, through the use of both dense and 
sparse implementations of the algorithm, we have shown that CP-WOPT was 
always faster in our studies when compared to the best alternative approach 
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based on second-order optimization (INDAFAC) and even faster than EM-ALS 
for high percentages of missing data. 

We consider the practical use of CP-WOPT algorithm in two different appli- 
cations which demonstrate the effectiveness of CP-WOPT even when there may 
be low correlation with a multi-linear model. In multi-channel EEG analysis, 
the factors extracted by the CP-WOPT algorithm can capture brain dynamics 
even if signals from some channels are missing, suggesting that practitioners 
can now make better use of incomplete data in their analyses. We note that the 
EEG data was centered by using the means of only the known entries; however, 
robust techniques for centering incomplete data are needed and is a topic of 
future investigation. In network traffic analysis, CP-WOPT algorithm can be 
used in the context of tensor completion and recover the missing network traffic 
data. 

Although 90% or more missing data may not seem practical, there are situa- 
tions where it is useful to purposely omit data. For example, if the data is very 
large, excluding some data will speed up the computation and enable it to fit in 
memory (for large-scale data like the example of a 1000 x 1000 x 1000 tensor that 
would normally require 8GB of storage). It is also common to leave out data 
for the purpose of rank determination or model assessment via cross-validation. 
For these reasons, it is useful to consider high degrees of missing data. 

In future studies, we plan to extend our results in several directions. We will 
include constraints such as non-negativity and penalties to encourage sparsity, 
which enable us to find more meaningful latent factors from large-scale sparse 
data. Finally, wc will consider the problem of collective factorizations with 
missing data, where we are jointly factoring multiple tensors with shared factors. 

Appendix A. Additional comparisons for randomly missing data 

Figures A. 10 and A. 11 show box plots of the FMS scores for problems of size 
100 X 80 X 60 and 150 x 120 x 90, respectively. See §5.3 and Figure 4 for a detailed 
description of the experimental setup and the results for analogous problems of 
size 50 X 40 X 30. We can sec from these figures that the three methods being 
compared (CP-WOPT, INDAFAC, EM-ALS) arc indistinguishable in terms of 
accuracy, as was the case for the smaller sized problems. Furthermore, compa- 
rable improvements using multiple starts can be seen for these larger problems; 
we are able to solve a majority of the problems with very high accuracy using 
five or fewer starting points (the n-mode singular vectors used as the basis for 
the first run and random points used for the others). 

Appendix B. Comparisons for structured missing data 

We use the same experimental set-up as described in §5.3, with the exception 
that we performed experiments using problems with up to 90% missing data (as 
opposed to 95%) and we only show results for three starting point (the first 
is the n-mode singular vectors and the remaining two are random). In the 
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Figure A. 10: Comparison of accuracy for problems of size 100 x 80 x 60 for 
randomly missing entries with different amounts of missing data. The methods 
are CP-WOPT (red), INDAFAC (green), and EM-ALS (blue). We plot cumu- 
lative results for multiple starts. The first start is based on the n-mode singular 
vectors, and the remaining starts are random. 
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Figure A. 11: Comparison of accuracy for problems of size 150 x 120 x 90 for 
randomly missing entries with different amounts of missing data. The methods 
are CP-WOPT (red), INDAFAC (green), and EM-ALS (blue). We plot cumu- 
lative results for multiple starts. The first start is based on the n-mode singular 
vectors, and the remaining starts are random. 
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case of randomly missing fibers, without loss of generality, we consider missing 
fibers in the third mode only. In that case, W is an J x J binary matrix with 
exactly [M/JJ randomly selected entries are set to zero, and the binary tensor 
W is created by stacking K copies of W together. We again require that every 
slice of IV (in every direction) have at least one nonzero, which is equivalent 
to requiring that W has no zero rows or columns. For each problem size and 
missing data percentage, thirty independent test problems were generated. 

The average FMS results are presented in Figures B.12, B.13 and B.14 for 
problems of sizes 50 x 40 x 30, 100 x 80 x 60 and 150 x 120 x 90, respectively. 
As for the problems with randomly missing data, we see that as the percentage 
of missing data increases, the average FMS scores tend to decrease. However, 
one notable difference is that the average FMS scores for structured missing 
data are in general lower than those for problems with comparable amounts of 
randomly missing data. This indicates that problems with structured missing 
data are more difficult than those with randomly missing data. 

The computational times required to compute the CP models using the 
different methods are present in Figure B.15, where we observe comparable 
results to those presented in Figure 5 for the problems with randomly missing 
data. With 80% or less structured missing data EM-ALS is fastest, whereas the 
sparse version of CP-WOPT is in general faster when there is more than 80% 
missing data. 
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Figure B.12: Comparison of accuracy for problems of size 50 x 40 x 30 for ran- 
domly missing fibers with different amounts of missing data. The methods are 
CP-WOPT (red), INDAFAC (green), and EM-ALS (blue). We plot cumula- 
tive results for multiple starts. The first start is based on the n-mode singular 
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